SYSTEM AND METHOD FOR DETECTING 
AND MANAGING HPC NODE FAILURE 

Inventors: James D. Ballew et al. 

Date Filed: April 15,2004 

Attorney's Docket: 064747.1015 

Sheet 1 of 10 

1/10 • 




140 



FIG. 1 



SYSTEM AND METHOD FOR DETECTING 
AND MANAGING HPC NODE FAILURE 

Inventors: James D. Ballew et al. 

Date Filed: April 15,2004 

Attorney's Docket: 064747.1015 

Sheet 2 of 10 




SYSTEM AND METHOD FOR DETECTING 
AND MANAGING HPC NODE FAILURE 

Inventors: James Ballew et al. 

Date Filed: April 15,2004 

Attorney's Docket: 064747. 1015 

Sheet 3 of 10 



3/10 




SYSTEM AND METHOD FOR DETECTING 
AND MANAGING HPC NODE FAILURE 

Inventors: ' James D. Ballew et al. 
Date Filed: April 15,2004 

Attorney's Docket: 064747. 1015 

Sheet 4 of 10 

4/10 




FIG. 2C 




SYSTEM AND METHOD FOR DETECTING 
AND MANAGING HPC NODE FAILURE 

Inventors: James D. Ballew et ah 

Date Filed: April 15,2004 

Attorney's Docket: 064747.1015 

Sheet 5 of 10 



5/10 



FIG. 3 A 



ETHERNET 
MANAGEMENT <£ 
INTERFACE 




LINUX 
BIOS 



315 



\ 



DUAL DDR 
„ MEMORY 
I J 340b 



INFINIBAND 
HCA 

"TH — 

335a 



N-P0RT 
SWITCH 

TTTTTn 



INFINIBAND 
HCA 

— |~v 

345 335b 



315 
\ 



FIG. 3B 

MOTHERBOARD WITH PCI EXPRESS 



DUAL DDR ft 
MEMORY < 
340a [[ 



HYPER I 
TRANSPORT 




HYPER 
TRANSPORT 



330a- 



320a 320b 

L A 

HYPER 
TRANSPORT 

—? 

325 



I HYPER 
TRANSPORT 




DUAL DDR 
MEMORY 
340b 



HT/PCI 
BRIDGE 



HYPER 
TRANSPORT 



HT/PCI 
BRIDGE 



-330b 



PCI-EXPRESS 



PCI-EXPRESS 



335a- 



HCA 



345 

S 



HCA 



N-P0RT SWITCH 



•335b 



GB ETHERNET 
MANAGEMENT 
INFRASTRUCTURE 



GB ETHERNET 
MANAGEMENT 
INFRASTRUCTURE 



SYSTEM AND METHOD FOR DETECTING 
AND MANAGING HPC NODE FAILURE 

Inventors: James D. Ballew et al. 

Date Filed: April 15,2004 

Attorney's Docket: 064747.1015 

Sheet 6 of 10 



6/10 



315 

s 



DUAL DDR 
MEMORY 

340a 



30Gb/sec 
350a- 



FPGA 



FIG. 3C 



DAUGHTER BOARD 

DUAL CACHE 
COHERENT 
INTERFACES 



I 

HYPER 
TRANSPORT 



FPGA 



30Gb/sec 
-350b 



320a 



320b 



CPU 



HYPER 
TRANSPORT 



HYPER 
TRANSPORT 

— 7 

325 



I 

HYPER 
TRANSPORT 











CPU 










— I 





330a 
PCI-EXPRESS 

335a- 



HT/PCI 
BRIDGE 



HYPER 
TRANSPORT 



HT/PCI 
BRIDGE 



HCA 

IX 



345 



HCA 



-330b 
f u EXPRESS 

-335b 



I I I I I I I 



24-PORT SWITCH 
T 



T 



1 23456789 



I I I I I I I 

10 11 12 13 14 15 16 



FABRIC PORTS 



DUAL DDR 
MEMORY 
340b 



SYSTEM AND METHOD FOR DETECTING 
AND MANAGING HPC NODE FAILURE 

Inventors: Jafnes D. Ballew et al. 

Date Filed: April 15,2004 

Attorney's Docket: 064747. 1 0 1 5 

Sheet7ofl0 



7/10 



400a 



| File Edit View Go Favorites Help | Address | 



COMPUTE NODES 



I/O NODES 



E 




E 



I | | <S l nternet *°ne 



SYSTEM AND METHOD FOR DETECTING 
AND MANAGING HPC NODE FAILURE 

Inventors: ' James D. Ballew et al. 

Date Filed: April 15,2004 

Attorney's Docket: 064747. 1015 

Sheet 8 of 10 



8/10 




File Edit View Go Favorites Help 



| Address F 



\o- o~ 0Hfl l (S s o 



m 




E 



□□□I 



Interne* zone 



400b 



FIG. 4B 



505- 



510- 



PHYSICAL 
MANAGER 

— A — 



515 



JOB 
SCHEDULER 



VIRTUAL 
MANAGER 



520 



s 



521 -£^P 


& 

1 ^523 


522^Q^ 


& 

1 r^524 




K525 



FIG. 5 

JOB 



C=l POLICIES 



500 



SYSTEM AND METHOD FOR DETECTING 
AND MANAGING HPC NODE FAILURE 

Inventors: James D. Ballew et al. 

Date Filed: Aprill5,2004 
Attorney's Docket: 064747 . 1 0 1 5 

Sheet 9 of 10 



9/10 



605- 
610- 
615- 

620- 

625- 
630- 



FIG. 6 



RECEIVE JOB SUBMISSION FROM USER 



SELECT GROUP BASED ON USER 



COMPARE USER TO GROUP ACL 



SELECT VIRTUAL CLUSTER 
BASED ON GROUP 



RETRIEVE POLICY BASED 
ON JOB SUBMISSION 



DETERMINE DIMENSIONS OF JOB 



600 



650- 

655- 

660- 
665- 



ARE 

ENOUGH NODES 
AVAILABLE 
? 

635 t^ES 



DYNAMICALLY DETERMINE OPTIMUM 
SUBSET FROM AVAILABLE NODES 



SELECT SUBSET OF NODES FROM 
SELECTED VIRTUAL CLUSTER 



ALLOCATE SELECTED NODES FOR JOB 



EXECUTE JOB USING ALLOCATED 
SPACE BASED ON JOB PARAMETERS 
AND RETRIEVED POLICY 



DETERMINE EARLIEST 
AVAILABLE SUBSET OF 
NODES IN VIRTUAL CLUSTER 



ADD JOB TO JOB QUEUE 
UNTIL SUBSET IS AVAILABLE 



640 



-645 



SYSTEM AND METHOD FOR DETECTING 
AND MANAGING HPC NODE FAILURE 

Inventors: James D. Ballew et al. 

Date Filed: April 15,2004 

Attorney's Docket: 064747. 1015 

Sheet 10 of 10 

10/10 



700 
\ 



705- 



710 



715- 



720- 



730- 



735 



740 



745- 



FIG. 7 



SORT QUEUE BASED 
ON PRIORITY 



DETERMINE NUMBER OF 
AVAILABLE NODES IN 
VIRTUAL CLUSTER 



SELECT FIRST JOB 
FROM JOB QUEUE 



DYNAMICALLY 
DETERMINE OPTIMUM 
SHAPE OF SELECTED JOB 



725 

ARE 
ENOUGH 
NODES AVAILABLE FOR 
SELECTED JOB 

? 

'yes 



DYNAMICALLY ALLOCATE 
NODES FOR SELECTED JOB 



RECALCULATE NUMBER 
OF AVAILABLE NODES IN 
VIRTUAL CLUSTER 



EXECUTE JOB ON 
ALLOCATED NODES 



SELECT NEXT JOB 
IN JOB QUEUE 



NO 



800 



805- 



810 



820- 



825 



830 



835 



840- 



845 



850- 



FIG. 8 

DETERMINE THAT 
NODE HAS FAILED 



REMOVE NODE FROM 
VIRTUAL CLUSTER 




DETERMINE OTHER NODES 
ASSOCIATED WITH JOB 



KILL JOB ON ALL NODES 



DEALLOCATE NODES 



RETRIEVE POLICY 
AND PARAMETERS 
FOR KILLED JOB 



DETERMINE OPTIMUM 
SUBSET OF NODES IN 
VIRTUAL CLUSTER 
BASED ON POLICY 
AND PARAMETERS 



DYNAMICALLY ALLOCATE 
SUBSET OF NODES 



EXECUTE JOB ON 
ALLOCATED NODES 



